implemented L1 distance & fixed L2 in binary quantization #21

kaancfidan · 2023-11-22T00:42:45Z

This pull request is related to qdrant/qdrant#3052.

It implements L1 distance type to be used for the matching Manhattan distance implementation in qdrant/qdrant.

I have tried to make the changes as non-disruptive as possible, but since L1 distance cannot be related to dot product as easily as L2, there had to be some branching between scoring implementations.

I have added tests and benchmarks and verified the implementation works for simple, AVX, SSE and Neon instructions.

kaancfidan · 2023-11-22T07:55:58Z

I just realized the changes in C implementations broke qdrant windows build, I'm trying to understand why cl.exe can't build these.

kaancfidan · 2023-11-22T08:27:44Z

I had seen the l1_avx is not used warning, but it is being used in the benchmarks. I could remove the benchmarks and the implementation of course but do you have a better idea?

kaancfidan · 2023-11-22T10:26:32Z

I have realized that this L1 distance is currently not compatible with binary quantization. I had followed mostly what had been done for L2, but I suspect L2 is also incompatible with it.

Edit: On second thought, I think it may already be compatible as the current code uses XOR in case of invert = true and the truth tables match perfectly (as it's the actual Hamming distance in the case of L1)

A	B	Dot product	L1
-0.5	-0.5	0.25	0
-0.5	0.5	-0.25	1
0.5	-0.5	-0.25	1
0.5	0.5	0.25	0

A	B	NXOR	XOR
0	0	1	0
0	1	0	1
1	0	0	1
1	1	1	0

kaancfidan · 2023-11-22T15:29:21Z

I have fixed both L1 and L2 distances and added tests for both in case of binary quantization.

quantization/src/encoded_vectors_binary.rs

kaancfidan · 2023-11-22T21:49:51Z

I have fixed both L1 and L2 distances and added tests for both in case of binary quantization.

I have run a benchmark with real data using Euclid + binary quantization (using qdrant itself). It did not work well. I need to investigate and understand the difference between the toy test data and the real one.

I'll also check L1 tomorrow. Meanwhile, I have set the status to draft again.

timvisee

Did a first review pass with some minor comments. Thank you very much for your time and effort to implement this! We'll definitely take a proper look at this.

quantization/tests/metrics.rs

quantization/tests/test_avx2.rs

quantization/src/encoded_vectors_binary.rs

quantization/src/encoded_vectors.rs

quantization/src/encoded_vectors_u8.rs

kaancfidan · 2023-11-23T15:28:29Z

I have fixed both L1 and L2 distances and added tests for both in case of binary quantization.

I have run a benchmark with real data using Euclid + binary quantization (using qdrant itself). It did not work well. I need to investigate and understand the difference between the toy test data and the real one.

I'll also check L1 tomorrow. Meanwhile, I have set the status to draft again.

I have been banging my head on this all day.

I am pretty sure L1 and L2 distances and inverses (similarities) are calculated properly. I have gone through unit tests, added integration tests that use binary and scalar quantizations on qdrant...etc. they all work as I expect.

Nevertheless, when I run our benchmark image vector dataset with this implementation, I still get 0% accuracy with L1 and L2 using binary quantization while dot product gets about 84%.

I will let it go if I can justify it as "our problem is just not compatible with binary quantization" but the fact that dot product works annoys me to no end. I am open to suggestions.

kaancfidan · 2023-11-23T20:57:21Z

OK, I had an epiphany where I understood that the binary calculate_metric method does not need to actually calculate the correct metric (dot, L1 or L2) but just has to approximate their behavior and put the vectors in the same order. The actual score is being calculated when rescoring anyway.

I have relaxed the L1 and L2 binary tests to just check the similarity order but not care about the actual score values. This way I could see that the previous implementation with dot product is still on point for both L1 and L2, but it was just inverse of what it was supposed to be. The many layers of inversing confused me too.

I apologize to the reviewers that they had to go through my learning experience with me. :)

I will run the large benchmark with this latest version as soon as I can build a docker image.

edit: aaaand we have a winner! 🍾

timvisee

Nice work!

I must admit that I didn't look into the C implementations. Maybe @IvanPleshkov can take a look.

quantization/cpp/sse.c

quantization/cpp/neon.c

quantization/cpp/avx2.c

quantization/tests/test_pq.rs

quantization/src/encoded_vectors_binary.rs

IvanPleshkov · 2023-11-26T19:17:07Z

It's a great work! I really wondered! Impressed with correct scalar quantization alpha and multiplier values and fixed for non-L1 related bugs. I just added some requests for C code. Everything else looks perfect!

kaancfidan added 2 commits November 22, 2023 03:36

implemented L1 distance

b7ee4e2

Added missing includes for abs

28b9f25

types simplified in SSE implementation

f314ce9

kaancfidan mentioned this pull request Nov 22, 2023

Manhattan distance qdrant/qdrant#3079

Merged

9 tasks

added allow dead_code to l1_avx

cd48e0e

fixed neon implementation and tests

5805813

kaancfidan marked this pull request as draft November 22, 2023 14:36

fixed L1 and L2 distances in binary quantization

473eaff

kaancfidan marked this pull request as ready for review November 22, 2023 15:18

kaancfidan commented Nov 22, 2023

View reviewed changes

quantization/src/encoded_vectors_binary.rs Outdated Show resolved Hide resolved

generall reviewed Nov 22, 2023

View reviewed changes

quantization/src/encoded_vectors_binary.rs Outdated Show resolved Hide resolved

generall requested a review from IvanPleshkov November 22, 2023 18:43

binary quantization alpha is now optional

96d3b95

kaancfidan marked this pull request as draft November 22, 2023 21:45

timvisee reviewed Nov 23, 2023

View reviewed changes

scoring direction fixes and some refactoring based on review

c1b2013

simplified calculate_metric

9c9a8bc

comment fixes and reordering for clarity

f1775fe

kaancfidan marked this pull request as ready for review November 23, 2023 21:35

kaancfidan requested a review from timvisee November 23, 2023 21:39

kaancfidan changed the title ~~implemented L1 distance~~ implemented L1 distance & fixed L2 in binary quantization Nov 23, 2023

timvisee approved these changes Nov 24, 2023

View reviewed changes

Merge imports

96ecdf0

timvisee added 2 commits November 24, 2023 09:51

Align truth table

3be51a3

Remove redundant copies

694abe1

IvanPleshkov requested changes Nov 26, 2023

View reviewed changes

AVX, SSE and Neon fixes for uint8

9fa1a78

IvanPleshkov approved these changes Nov 26, 2023

View reviewed changes

generall approved these changes Nov 27, 2023

View reviewed changes

timvisee merged commit 939fdb6 into qdrant:master Nov 27, 2023
2 checks passed

timvisee mentioned this pull request Aug 28, 2024

Integrate Dimension Insensitive Euclidean Metric (DIEM) into Qdrant qdrant/qdrant#4963

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implemented L1 distance & fixed L2 in binary quantization #21

implemented L1 distance & fixed L2 in binary quantization #21

kaancfidan commented Nov 22, 2023 •

edited

Loading

kaancfidan commented Nov 22, 2023

kaancfidan commented Nov 22, 2023

kaancfidan commented Nov 22, 2023 •

edited

Loading

kaancfidan commented Nov 22, 2023

kaancfidan commented Nov 22, 2023

timvisee left a comment •

edited

Loading

kaancfidan commented Nov 23, 2023 •

edited

Loading

kaancfidan commented Nov 23, 2023 •

edited

Loading

timvisee left a comment

IvanPleshkov commented Nov 26, 2023

implemented L1 distance & fixed L2 in binary quantization #21

implemented L1 distance & fixed L2 in binary quantization #21

Conversation

kaancfidan commented Nov 22, 2023 • edited Loading

kaancfidan commented Nov 22, 2023

kaancfidan commented Nov 22, 2023

kaancfidan commented Nov 22, 2023 • edited Loading

kaancfidan commented Nov 22, 2023

kaancfidan commented Nov 22, 2023

timvisee left a comment • edited Loading

Choose a reason for hiding this comment

kaancfidan commented Nov 23, 2023 • edited Loading

kaancfidan commented Nov 23, 2023 • edited Loading

timvisee left a comment

Choose a reason for hiding this comment

IvanPleshkov commented Nov 26, 2023

kaancfidan commented Nov 22, 2023 •

edited

Loading

kaancfidan commented Nov 22, 2023 •

edited

Loading

timvisee left a comment •

edited

Loading

kaancfidan commented Nov 23, 2023 •

edited

Loading

kaancfidan commented Nov 23, 2023 •

edited

Loading